Fault Recovery for DistributedShared Memory
نویسندگان
چکیده
ion between the user's application and the message passing primitives available on the target systems. Distributed shared memory (DSM) provides users with the abstraction of shared memory on networks of physically distributed machines. This programming model is widely considered to be more intuitive for programmers as compared to message passing languages. Because DSM systems are implemented on networks of workstations, they provide a relatively low cost environment. As systems scale to hundreds or even thousands of processors, however, the probability of a node failing will increase. If no precautions are taken when a node fails, the program must be restarted from the beginning. For long running large scale parallel programs, the time required to restart the computation from the beginning can be substantial. If instead a program's state is periodically stored in a checkpoint, the program can instead roll back and restart from the most recent checkpoint. In addition, as part of the debugging process, a faulty program can be restarted from a checkpoint repeatedly. Techniques for checkpointing a single Unix task have been developed [33]. A checkpoint consists of a snapshot of the program's state when the checkpoint is taken. This includes the contents of memory and registers used by the program as well as state information about I/O (for example, the current set of open les and the current position in les that are sequentially accessed). Randomly accessed les can be more problematic [32]. In addition, some aspects of the operating system's state can be di cult to reproduce. For example, Unix process id's are not recoverable [33]. After a failure occurs and the system is once again stable, the program can be recovered by loading the state information from the most recent checkpoint and restarting execution. Recording and restarting from a checkpoint in a distributed system is further complicated because the global state of a parallel program can be di cult to determine. Each processor is only capable of storing its own memory and messages it has sent or received [8]. One approach to checkpointing a distributed system is to have processors coordinate to store a consistent global state [8], [38], [10], [32], [31]. These techniques rely on sending messages between processes to ensure that a globally consistent state exists. When a globally consistent state is recognized, each processor can take a local checkpoint, the complete set of which constitutes a global checkpoint. Alternatively, to avoid the overhead of ensuring a globally consistent state, each processor can independently record a local checkpoint when it is convenient to do so [11], [19], [37], [29], [44], [30], [36]. When the checkpointing system restarts execution from a checkpoint after a failure, it is said to roll back to the checkpoint. Sometimes the rollback of one process will force other processors to roll back because of dependencies between the processors. That may, in turn, cause more processors, including the failed processor to roll back. This e ect is known as rollback propagation or the domino e ect [35]. In the worst case, the program will have to roll back to the beginning of its execution. The problem of roll back propagation when restarting from local checkpoints can be ameliorated by supplementing the checkpoints with a record of all the messages sent by each process and the dependencies that they create. DSM o ers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. DSM promises to be the dominant paradigm for future high-performance computing. DSM systems provide a shared memory abstraction via a run-time system that passes messages to support the memory abstraction. Because DSM systems employ message passing at the lower levels, it is possible to apply techniques similar to those used to checkpoint the underlying message passing pattern, but checkpointing systems that take advantage of the communication patterns of a DSM system can bene t signi cantly. This paper compares various methods of checkpointing for message passing and distributed shared memory systems. Section 2 describes message passing systems. Section 3 gives an overview of issues related to distributed shared memory systems and several examples of distributed shared memory systems. Section 4 discusses checkpointing in distributed shared memory systems. 2. Message Passing Systems Processes in a parallel program must communicate to perform meaningful computations. Message passing provides processes with an e cient means of communication. When checkpointing a distributed system, the state of the communication channels between processes must be taken into account, in addition to the state of each of the processes, to guarantee that the checkpoint represents a consistent global state of the computation [8]. A consistent global state is a state the program could have reached through its normal execution. If a parallel program fails and is restarted from a checkpoint that represents a consistent global state, the results of the program will be indistinguishable from the results obtained by running the program without failure. Figure 1 demonstrates the situation that leads to an inconsistent global state. The horizontal lines represent the execution of the processes that make up a parallel program. The arrows between them represent messages between the processes. The short vertical bars show where each process records its state in a local checkpoint. m2 m1 m3 C2 C1 P P P 1 2 3 Figure 1. Neither global checkpoint C1 nor global checkpoint C2 represent a consistent global state. In global checkpoint C1 message m1 has been received, but not sent. In global checkpoint C2, messagem2 has been sent, but not received. A global checkpoint is simply a set of local checkpoints, one in each process. A global checkpoint can be formed by drawing a line connecting one checkpoint in each process (as shown by the dashed lines). If a line drawn connecting all of the local checkpoints in a global checkpoint crosses a message then either the message was recorded as sent by one process, but not received by another process or it was recorded as received in one process, but not sent in another process. In either case, the global checkpoint is said to be inconsistent. If computation were restarted from such a checkpoint, the program would be in a state that could not have been reached by normal execution of the program. For example, checkpoint C1 is inconsistent because message m1 will be recorded as received by process p2, but not sent by process p1. If the program were to attempt to restart from checkpoint C1, p1 would send p2 an extra copy of message m1. Checkpoint C2 is inconsistent because message m2 will be recorded as sent by process p3, but not received by process p2. If the program attempted to restart from checkpoint C2, process p2 would never receive messagem2 from process p3 even though process p3 has sent it. The checkpointing system must take measures to guarantee the state represented by a global checkpoint is consistent before it is used for recovery. It can maintain consistency one of two ways. It can either force processes to coordinate in such a way that there is always a consistent state when each process saves its individual state [8], [38], [31], [10], or it can let each processor checkpoint independently [37], [20], [19], [11], [44]. If each processor is allowed to checkpoint independently, steps must be taken during recovery to determine the most recent set of local checkpoints that make up a consistent global checkpoint. At least enough information must be kept to determine if any given global checkpoint is consistent. Conceptually, a simple way to implement independent checkpointing is to have each process periodically checkpoint and retain all the local checkpoints for every process. When an error occurs, the program rolls back to the last consistent global checkpoint. The advantage of this type of system is its low failure free overhead. There is no synchronization between processes to checkpoint and each process can checkpoint whenever it is convenient. The system only spends a small amount of overhead to track which messages might cause a global checkpoint to be inconsistent. However, it is possible that a program may have to roll back past more than one local checkpoint to arrive at a consistent global checkpoint. In the worst case, the program would have to roll all the way back to the start of computation. Figure 2 shows one case where it is impossible to create a consistent global checkpoint from any of the local checkpoints without going all the way back to the start of computation. As mentioned earlier, this is referred to as the domino e ect. Some old checkpoints must be saved because they may be needed if a process is forced to roll back past the most recent checkpoint, but the checkpointing system can save some space by throwing away local checkpoints than can never form a consistent global state [42]. C C C C C C C C C C C 2,4 p p p3 2 1 1,1 2,1 3,1 1,2 1,3 3,2 3,3 3,4 m5 m7 2,2 2,3 m2 m4 m6 m8 m3 m1 Figure 2. In this case, the only set of local checkpoints that form a consistent global checkpoint is the set of initial checkpoints, fC1;1, C2;1, C3;1g. If the computation between messages is deterministic, the domino e ect can be avoided by logging the messages sent between processes [39]. To recover from a failure, some set of processes roll back to their last set of local checkpoints. A recovering process then executes as normal except that when it would have received a message, it reads a message from its message log. When a recovering process attempts to send a message that has been logged, the message is discarded. When a process has exhausted the contents of its log, execution continues as normal. For example, in Figure 2, if all the messages are logged and process p2 fails before local checkpoint C2;4, it can restart from checkpoint C2;3. When the process attempts to send message m6, message m6 is discarded. When it attempts to receive message m8, the recovering version of p2 will replay message m8 from its message log. Messages may be logged synchronously or asynchronously. In a system that logs messages synchronously, the process logging the message is blocked until the entire message is written to stable storage. This is also referred to as pessimistic message logging [39]. Alternatively, messages can be logged asynchronously, allowing the process to continue execution while the message is written to the message log. This technique, also called optimistic message logging, can boost performance signi cantly if failures are rare [39], but under asynchronous message logging not all old checkpoints can be discarded because it is possible for a processor to fail before a message is completely logged For example, in Figure 3, if process p2 fails after message m1 is received, but before it is logged, process p1 will have to roll back to checkpoint C1;1 to reproduce messagem1. To save space, a garbage collection routine may be used to determine which checkpoints are not useful [42]. [42]. C C C m1 C 2,1 2,2 1,1 1,2 C 1,3 p p 2 1 Figure 3. If the message m1 is not logged before process p2 fails then process p1 will have to roll back to checkpoint C1;1 even though it has more recent checkpoints. Independently of whether the message is logged synchronously or asynchronously, either the sender or the receiver of a message may log the message [19], [11]. In receiver based message logging, the process receiving a message logs the message. If the message is logged synchronously the process receiving the message is not allowed to continue and the the process that sent the message is not acknowledged until the message is logged to stable storage. In the event of a failure, only the failing process or processes are required to roll back because all the information the recovering process needs is in its most recent checkpoint and its message log. The recovering process cannot depend on any unlogged message because the original process was not allowed to continue until the message was logged. Another advantage of synchronous receiver based checkpointing is that there is enough information on stable storage to tolerate any number of failures, including complete failure of the whole system. In addition, only the most recent checkpoint is needed to recover from a failure, so old checkpoints may be discarded. Performance can be improved by logging messages asynchronously at the cost of extra space for old checkpoints and the possibility of limited roll back propagation. Under sender based message logging, the process sending the message logs it [19]. Sender based message logging is most e ective in conjunction with optimistic message logging. There is very little overhead during execution because the process sending a message does not have to wait for it to be logged, and it can be sent before it is logged. If only one processor fails at a time, the process running on the failed processor can use the message logs of surviving processes to recover. Even if the contents of the failed processor's message log are lost, none of the surviving processes need to roll back because the messages needed for the failed process to recover are stored in the surviving processor's message logs. The failed processor's message log can be regenerated as the process re-executes. The ability to recover from single processor failures without forcing surviving processes to roll back is in contrast to receiver based optimistic message logging in which a failure might result in the loss of some messages which have not been completely written to stable storage. In this case some of the surviving processes will be forced to roll back along with the failing process. Based on the observation that the real goal in trying to prevent rollbacks is to reduce the amount of time required to recover from a failure, while adding as little overhead as possible during failure free execution techniques have been developed that bound the restart time and reduce the number of messages that need to be logged [30], [29], [28]. It is not necessary to log every message. It is su cient to log only those messages that would cause rollback propagation because unlogged messages can be computed on the y by executing parts of other processes. Some checkpointing systems use this information to reduce the number of messages logged [30], [28] or to put a practical upper bound on the amount of time required to restart a process [29]. The advantage of this method is its low failure free overhead and its near bounded playback time. An alternative to allowing processes to checkpoint independently is to force them to coordinate when they checkpoint [8], [38], [31], [10]. Coordinated checkpointing causes each process to synchronize so there are no messages in transit or in transit messages can be included in the local checkpoints. Coordination of all the processes participating in a large computation can take a signi cant amount of time depending on the relative speed of communication and computation, the checkpoint frequency, and the number of processes involved. Coordinated checkpointing has been shown to perform better than message logging for some systems, but independent checkpointing without message logging performed slightly better in most cases [10], [11]. However tests were run on a network of 16 workstations on an ethernet, so it is unclear how performance results will scale to larger systems running across a wide area network. A range of techniques for checkpointing message passing machines have been explored. Many of these ideas and concepts can be applied to checkpointing distributed shared memory systems. 3. Distributed Shared Memory Distributed shared memory provides a shared memory programming abstraction on a network of machines for which it is generally considered easier to write programs than for message passing models. If several processes in a message passing system need to share data, the programmer must explicitly send the data to each process. Writing a program to e ciently distribute the data among processes and keep each process updated can be di cult and tedious. In a DSM system, processes access shared data the same way they access regular memory. Changes to shared data are propagated to the processes that need them by the DSM. Classi cation of DSM Systems The ease of programming of a DSM system and performance achieved by programs written for it are a ected by a variety of features [27], [34], [7], [4], [21], [13]. Significant characteristics by which a DSM system can be classi ed are: the boundaries on which objects are shared, how to determine which processor has the most recent copy of share data, the memory consistency model of the system (i.e., the guarantees the system makes about how each processor sees updated versions of shared data), whether the system is implemented at the library or operating system level, and the amount of hardware support required. One major distinguishing characteristic of DSM systems is the basic unit of memory, or sharing unit, which is the smallest amount of memory shared between processors. Many DSM systems share data in logical units the size of a virtual memory page [27], [7], [4]. Most page based systems exploit the existing virtual memory hardware. In this case, the DSM is fairly transparent to the user's program. It is straightforward to port applications from hardware shared memory machines and the systems are not tied to a speci c programming language. The compiler does not have to be modi ed because the underlying DSM system manages sharing by manipulating each process's page table. Also, if the operating system supplies user level hooks to handle page faults it does not require changes to the operating system. One common problem of page based systems is false sharing. False sharing occurs if two shared data structures exist in the same page, but are frequently modi ed by di erent processors, the page will have to be sent back and forth repeatedly between the two processors even though the processors are accessing unrelated variables. One way to overcome false sharing is to share data between processors on segment boundaries. In segment based DSM systems [21], [13], the user speci es the size of each region of shared memory. Segment based DSM systems can avoid false sharing because they know the boundaries of each segment. Generally, segmented systems cannot use the virtual memory hardware in the processor which handles pages of xed size. Therefore, some form of annotation is required in the source code of a program written for a segment based system to indicate the beginning and ending of accesses to shared memory. Object based sharing takes the segment idea to the extreme by allocating each object in an object oriented language its own segment and the compiler manages the shared data segments. Unfortunately, this ties programs for an object based DSM to a particular compiler. Another distinguishing characteristic of DSM systems is how each node nds shared data. In a DSM system, when a node needs to nd shared data, it must determine where to look for the data. For page based DSM systems, several ways have been devised to provide page table information through managers [27]. The simplest scheme is the centralized manager. One node contains all the information about which nodes have copies of each page and the state of the page (e.g., read-only, read-write, invalid). However, the node that contains the central manager can quickly become a bottleneck. The xed distributed manager scheme assigns the status information for each page to a particular node. The manager of a page does not change and can be determined from the page number. A node can always nd a page by asking at most one other node. This is an improvement over the centralized manager because the page handling is distributed over all of the nodes in the system. It does not, however, account for the usage patterns of the program. It is common for a group of nodes to pass a page between themselves throughout the course of a program. The manager node may never use one of the pages it manages, but it must deal with the overhead of managing this page. The manager may be relatively distant on the network from a group of nodes that use a page it manages, causing delays for the nodes using the page. In either case it is desirable for a node that actually uses the page to take care of it. Another approach, called the dynamic distributed manager, does not assign a xed node to each page [27]. Instead each node keeps track of the probable owner of a page. When a node needs to change the access level of a page from read-only to read-write or invalid to read-only, it requests the page from the page's probable owner. If the probable owner of a page is not the actual owner of the page, it forwards the request to the node it considers the probable owner. This sharing protocol guarantees that the real owner will eventually be found by following the trail of probable owners. When a node refers another node to the probable owner of a page, it resets the probable owner of the page to the requesting node to shorten future searches. In the worst case more messages may be sent to nd a page, but if a page is only shared by a few nodes it is likely fewer messages will be required on average [27]. DSM systems vary in the way in which processors see changes to memory. Memory consistency is the policy that determines how and when changes one processor makes will be viewed by other processors in the system. The most intuitive policy is sequential consistency [26]. Under sequential consistency reads and writes by one processor must be seen in the same order by all processors in the order speci ed by the program it is running. Writes by one processor are seen by subsequent reads from all other processors. For example, assuming that x and y are initially 0, the program in Figure 4 represents a program with two processes, p1 and p2. If they are part of a program running on a sequentially consistent system p1 and p2 will print one of the following \y = 0, x = 5", \x = 0, y = 8", \x = 5, y = 8", or \y = 8, x = 5". However, \x = 0, y = 0" could never happen under sequential consistency because \x = 0" implies that y has been assigned 8. Note that this is the same result as if the two processes where running on a single processor multi-tasking system. p1 p2 x = 5 y = 8 print \y = " y print \x = " x Figure 4. A simple program with two processes. Sequential consistency is intuitive, but it is ine cient to implement because it imposes consistency requirements that are more restrictive than necessary. Relaxed consistency models require the programmer to annotate regions of a program using shared data, but, they can be implemented e ciently. These annotations can be integrated into the synchronization methods of the DSM system that the programmer must use to prevent data races, regardless of the consistency model. When several processes in a program try to access the same data without any synchronization (i.e., locks, semaphores, barriers, etc.), as in Figure 4, the result is unpredictable because there is no way to determine the order in which processes will access the data. In most programs this situation, called a data race, is undesirable regardless of the consistency model. To avoid data races, accesses to shared data are surrounded by some form of synchronization. Relaxed consistency memory models take advantage of this by only guaranteeing memory will be consistent if a program uses appropriate synchronization (i.e., contains no data races). This reduces the frequency with which messages must be exchanged to notify other processors that the value of some shared data has changed. In addition, relaxed consistency eases false sharing by allowing multiple processes to write to a page at the same time. Each process keeps track of the parts of a page it changed and distributes its changes, rather than whole a page, as part of the consistency protocol [4]. A number of relaxed consistency models have been proposed [1], [23], [24], [2], [22], [3], [14]. Release consistency, one of the more popular varieties of relaxed consistency, di erentiates between accesses to synchronization variables and accesses to shared data. Under release consistency, synchronization variables are accessed with acquires and releases, which can be used to build locks, barriers, and semaphores. Writes to data variables are only guaranteed to be seen by other processors after the release of a synchronization variable. If all accesses to shared data are surrounded with some form of synchronization (i.e., the program has no data races) then this is indistinguishable from sequential consistency. If the program does have a data race then its behavior is unpredictable. This is generally not a problem because most programs do not contain data races. Memory consistency models can be implemented by one of two types of protocols. An update protocol broadcasts copies of changed data to all nodes that have a copy of the data when the memory consistency model requires other nodes to be made aware of the changes. Invalidate protocols tell nodes with copies of shared data to invalidate their copies. When a node needs to access data that has been invalidated, it must request it from a node that has a valid copy. Some DSM systems are implemented as a library on top of an existing operating system. Others are integrated with the operating system kernel. Implementation as a library that is compatible with a common operating system such as Unix allows portability across many platforms, but integration into the operating system kernel provides opportunities for performance improvements. For example, many page based DSM systems use user level Unix system calls to setup routines to handle page faults [4]. Handling page faults with user code is slower because a context switch is required. Some DSM systems compromise and only require slight modi cations to an existing operating system, while most of the system is implemented as a library. The amount of hardware required to support DSM also varies. Most page based DSM systems use the virtual memory facilities built into the processors on which they run to remap memory, to trap on writes to readonly pages, and to retrieve pages that the processor does not have in physical memory. Most segmented systems do not require any special hardware support [21]. Accesses to shared data must be annotated in the source code, or, in the case of a system that has a special compiler, the compiler provides extra code to access shared data. Some classify hardware based systems such as DASH as hardware DSM systems [34]. Example DSM Systems A number of DSM systems have been developed that use the above ideas. One of the rst DSM systems was IVY [27]. It showed that shared memory could be implemented on a distributed memory machine. IVY was a page based system implementing sequential consistency. It was primarily implemented as a user level library, but required some operating system modi cations to manage the page tables. The only hardware IVY required was the processor's built-in virtual memory hardware. IVY introduced the centralized manager, xed distributed manager, and dynamic distributed manager for handling shared pages. TreadMarks was designed to provide a portable, e cient distributed shared memory system. TreadMarks is a page based DSM implementing lazy release consistency [4]. It is implemented as a user level Unix library. It uses the standard built-in virtual memory hardware through operating system calls available in most implementations of Unix. TreadMarks uses a xed distributed manager to keep track of page ownership [24]. C Region Library (CRL) was designed to provide a portable DSM system that does not require any hardware support. It is a segment based DSM implemented as a user library [21]. Each access to a shared variable is surrounded with markers that denote the beginning and end of a region where the variable will be accessed. The consistency model is similar to entry or release consistency in that each shared variable may only be accessed within a region. However, there are no explicit synchronization variables. The markers at the beginning and end of a region only need mention the variable's name. Each shared segment has a xed home node, not unlike the xed distributed manager algorithm for page based systems. CRL has versions that run on the Alewife machine, CM5 computers, and PVM with Unix. Applications running on the Alewife version of CRL have been shown to perform competitively with the same application using hardware shared memory on Alewife. Unify is a scalable DSM capable of linking hundreds or thousands of high-performance machines in geographically distant locations [13]. Unify supports shared memory abstractions and mechanisms that mask the distribution of resources, reduce the frequency of communication and the amount of data transferred, hide the propagation latencies typical of large-scale networks (e.g., by overlapping communication with computation), and support large-scale concurrency via synchronization and consistency primitives free of serial bottlenecks. Unify provides a convenient data sharing abstraction with performance rivalling scalable message passing systems [40], [12]. Unify is a segment based DSM supporting three basic memory abstractions for shared data. Random access memory is directly addressable, very much like other DSM systems. Sequential access memory is accessed in a read/front, write/append fashion. Associative memory is accessed via pairs. Sequential access memory and Associative memory make use of a new memory consistency dimension called spatial consistency. Spatial consistency determines the relative order of the contents of the replicas of a segment. For many distributed applications that use keyed lookups or sequential access, the order of the data items within a segment is unimportant. Only the values of individual data items are important. Spatial consistency allows e cient implementation of such applications. In addition to its spatial consistency model, each segment can have any of a number of di erent temporal memory consistency models. Unify supports a set of consistency management primitives that allow an application to select the appropriate consistency semantics from a spectrum of consistency protocols, including automatic methods where the operating system enforces consistency and application-aided methods where the user de nes consistency. In addition, Unify supports weak automatic methods where the memory becomes consistent after some time lag T . This type of automatic consistency is particularly useful for applications that can detect stale data such as Grapevine [5]. To support scalability, the concept of \sharing domains" is introduced and event counts and sequencers are used rather than locks, semaphores, or barriers for synchronization. For a large class of applications, event counts and sequencers can result in reduced communication and greater concurrency. Also, to exploit localized sharing and communication, a Unify user can partition the set of hosts into sharing domains. Each sharing domain uses a separate multicast group to reduce the cost of intra-domain information sharing. Sharing domains distribute the burden of information retrieval and distribution by allowing any member of the domain to issue or answer inter-domain requests (addressed to multicast groups). 4. Checkpointing DSM The goal of checkpointing a distributed system is to save a consistent global state from which the computation can be restored in the event of a failure. For checkpointing DSM systems, the meaning of a consistent global state changes depending on the memory consistency model of the system. As with message passing systems, the two main approaches to checkpointing DSM systems are coordinated checkpointing and independent checkpointing. Coordinated checkpointing can be used for DSM in a way similar to coordinated checkpointing for message passing. All processes synchronize and record their states as well as in transit messages. As with message passing systems, only one checkpoint must be stored and there is no extra overhead for logging messages. To recover from a failure, all processes roll back to the most recent checkpoint. This type of coordinated checkpointing can be improved by taking advantage of the nature of DSM systems to reduce checkpointing overhead. One of the overheads of checkpointing is the time required to write data to stable storage. Rather than storing the state of shared memory to stable storage, which can be time consuming, one coordinated checkpointing scheme guarantees each page is replicated on at least two processors during a checkpoint [25]. If one processor fails, the pages stored on that processor will have been replicated in at least one other processor's memory. This technique will not handle more than one processor failure at a time. Experimental results show that as the number of processors in the system is increased, the overhead due to replicating pages is decreased. In some cases extra replication of a page can improve an application's performance because some pages will be fetched before they are needed. However, this depends on the program's behavior. Unfortunately, as with message passing systems, coordinating all of the processors to take a checkpoint can incur considerable overhead. For programs that frequently use barriers across all processors for synchronization, the checkpointing system can wait until a barrier to take a checkpoint [6]. When all processes are waiting at a barrier, but before any of them leave, the state of each process is saved. The state of the system is guaranteed to be consistent because all processes are waiting at the barrier. Though this works well for programs that use barriers regularly, not all programs use barriers and barriers are expensive in large scale systems. Independent checkpointing techniques have the advantage that no costly coordination is required. Given that distributed shared memory systems are built on an underlying message passing facility, one approach is to implement independent checkpointing for DSM system is to directly use one of the previously described message passing checkpointing methods on the underlying message passing. However, DSM systems tend to send significantly more messages than message passing systems and many of the messages do not cause dependencies between processors [17]. Recoverable DSM systems have been developed that reduce the amount of tracing by further reducing the number of messages that constitute data dependencies [18]. This system uses a variation of the xed distributed manager protocol which prevents request messages to the page manager and messages from the page manager to the owner of a page from becoming dependencies. Reducing the number of dependencies not only reduces the amount of information that needs to be tracked, but it also reduces the probability of rollback propagation. In message passing systems, rollback propagation can be eliminated by logging messages. During recovery, messages are replayed from the message log when the program executes a receive call. However, there are no explicit receive calls in a DSM systems. Page update and invalidate messages arrive at a processor at unpredictable times. A straight-forward solution to the problem of unpredictable update or invalidate message arrivals in a sequentially consistent system is to checkpoint when any write to shared data is available to another processor [43]. No processor will see any shared data which is not stored in the most recent checkpoint of the processor that wrote it, as shown in gure 5. In the gure, each read or write to a shared variable is denoted with an R or a W, respectively. Assume that initially the value of a variable, x, resides on the same processor as process p1. When process p1 receives a request for the page containing x, it saves a checkpoint. Likewise, when p2 receives the request for the page containing x, it must also checkpoint. Note that when p3 gets a request for the page containing y, it does not have to checkpoint because it has not modi ed any pages. This method, called communication induced checkpointing, avoids the need for processes to coordinate. In the event of a failure, failed processes roll back to their most recent checkpoint. No other processes need to roll back because the checkpoints break the dependencies between processes. The checkpointing frequency of communication induced checkpointing depends on the rate at which pages are shared between processors. Many programs share data frequently making communication induced checkpointing ine cient. p3 p1 p2 R(y) W(x) x R(x) reuest x R(x) R(x)
منابع مشابه
Analysis of a Fault - Tolerant Coherence Protocol for DistributedShared Memory Systems
We present a new class of dynamic coherence protocols, called Dynamic Boundary-Restricted (DBR) coherence protocol class, for DSM systems in error-prone networks whose instances ooer highly available access to DSM data at low operation costs. The approach is based on the Boundary-Restricted (BR) coherence protocol class. Our analysis provides the overall cost savings by using a DBR coherence pr...
متن کاملRecovery of Memory and Process in DSM
multiprocessor, shared memory, high availability In this report, we discuss the recovery of memory and processes on the platform of a shared-memory DSM system. We divide the problem into recovery of unaffected memory (RUM), and recovery of affected processes (RAP). We point out that specially designed faulttolerant, non-volatile memory is neither sufficient nor necessary to solve the problem of...
متن کاملFault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems
In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implement fault-tolerance based on Recovery Blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino-effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One s...
متن کاملIcare: Combining Efficiency and High-availability in a Dsm System
In light of the increasing throughput of local area networks, Networks Of Workstations (NOW) which provide a distributed shared memory (DSM) have become a convenient alternative to parallel architectures in the framework of parallel scientific applications. ICARE is a recoverable DSM based on backward error recovery which is implemented on top of an experiments ATM platform running the CHORUS m...
متن کاملCooperative Application/OS DRAM Fault Recovery
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience chal...
متن کاملError Recovery Mechanism using Dynamic Partial Reconfiguration
In this paper an error recovery mechanism for SRAM based FPGA systems is presented. Previous recovery methods employ processor cores as a reconfiguration controller consuming notable amount of device resources and introducing additional error detection and recovery latency. The described mechanism is controlled by a finite state machine architecture providing small hardware overhead and short r...
متن کامل